Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rank Methods #1733

Merged
merged 8 commits into from
Dec 18, 2017
Merged

Rank Methods #1733

merged 8 commits into from
Dec 18, 2017

Conversation

0x0L
Copy link
Contributor

@0x0L 0x0L commented Nov 21, 2017

  • Closes Rank function #1731
  • Tests added / passed
  • Passes git diff upstream/master **/*py | flake8 --diff
  • Fully documented, including whats-new.rst for all changes and api.rst for new API

@0x0L 0x0L force-pushed the rank_ffill branch 6 times, most recently from a00b098 to c8997c7 Compare November 21, 2017 22:39
@0x0L 0x0L changed the title [WIP] initial support for rank and ffill [WIP] initial support for rank Nov 21, 2017
@shoyer
Copy link
Member

shoyer commented Nov 23, 2017

Thanks for putting this together!

Rather than using the injection stuff in ops.py (which we would like to eliminate), I would prefer to simply implement this as a method on Dataset and DataArray.

Also take a look at the use of apply_ufunc in #1640 for a straightforward example of wrapping external functions. I think that would make things easier here.

@@ -23,6 +23,8 @@
try:
import bottleneck as bn
has_bottleneck = True
# monkeypatch numpy with rankdata
np.rankdata = bn.rankdata
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't monkeypatch -- this would affect NumPy for all xarray users.

@0x0L
Copy link
Contributor Author

0x0L commented Nov 24, 2017

@shoyer Thanks for the comment and the tips! I'll make the changes asap, hopefully this week-end

@0x0L
Copy link
Contributor Author

0x0L commented Nov 24, 2017

Dropped support for Dataset. I don't think there's much use for it anyway.

@0x0L 0x0L force-pushed the rank_ffill branch 2 times, most recently from f9c0bc5 to 94ac0ca Compare November 29, 2017 22:52
array([ 1., 2., 3.])
Dimensions without coordinates: x
"""
import bottleneck as bn
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO the exception is explicit enough

ranks = apply_ufunc(func, self,
dask='parallelized',
keep_attrs=True,
output_dtypes=[np.float_],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

based on the docs, it looks like the return type is always np.float64.

Ranks begin at 1, not 0. If pct is True, computes percentage ranks.

NaNs in the input array are returned as NaNs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a note here that this requires bottleneck.

# str
y = DataArray(['c', 'b', 'a'])
self.assertDataArrayEqual(y.rank('dim_0'), x)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to add test coverage for the pct=True option.

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great -- can you take a look at implementing this for Dataset, too? For consistency, it's nice if we have the same methods for both Dataset/DataArray if possible. I think this should be a pretty straightforward use of apply:

def rank(self, dim, pct=False):
    return ds.apply(lambda array: array.rank(dim, pct=pct))

@0x0L
Copy link
Contributor Author

0x0L commented Dec 4, 2017

@shoyer when the ranking dimension is missing from an array in the dataset, should we do nothing, or put 1s everywhere ? None of these looks appealing or natural.

@shoyer
Copy link
Member

shoyer commented Dec 4, 2017

when the ranking dimension is missing from an array in the dataset, should we do nothing, or put 1s everywhere ?

Good point, maybe it needs to be slightly more complicated than apply.

Two other options are raise an error (safest) or to drop all such variables from the result. The later is similar to what we do in aggregations like sum() if an argument is non-numeric. I think this is probably the best choice: it strikes a good balance between not returning an invalid result and not raising errors so often as to be annoying.

@0x0L
Copy link
Contributor Author

0x0L commented Dec 4, 2017

Should i move the implementation on Variable and use temp datasets like quantile does ?

@shoyer
Copy link
Member

shoyer commented Dec 4, 2017

Should i move the implementation on Variable and use temp datasets like quantile does ?

I would be happy either way here. In some ways, yes, that would be the cleanest solution.

kwargs=dict(axis=axis)).transpose(*self.dims)
if not pct:
return ranks
return (ranks - 1) / (self.count(dim) - 1)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shoyer with pct=True this is not the same implementation as pandas which returns rank / count. I think this one makes more sense but users might be suprised

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thanks for calling this out. The pandas definition makes a little more sense to me, as "fraction of the data with this rank or higher". I don't know how I would describe your version in a simple sentence.

Copy link
Contributor Author

@0x0L 0x0L Dec 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normalized ? It lies in [0, 1] not [1/Nsamples, 1]

My main issue is that I would expect the pct rank to average to 0.5 but with pandas we have

In [2]: pd.Series([1,2,3]).rank(pct=True)
Out[2]: 
0    0.333333
1    0.666667
2    1.000000

On the other hand, with my implem, pct rank is not defined for a dim of length 1

For the record, Bottleneck move_rank returns a polar rank in [-1, 1]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main issue is that I would expect the pct rank to average to 0.5

I agree that this would be a nice property, but I'm not sure it's worth the loss in interpretability.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about a str arg for specifying the normalization scheme ?
'pct', 'norm' and 'polar'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would polar mean?

All things being equal, we do try to lean towards doing what pandas does in xarray because consistency makes it easier to use both packages together.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

polar would be [-1, 1]
I get your point of view, i'll make the change to mimick pandas

----------
dim : str
pct : bool, optional
keep_attrs : bool, optional
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last comment, it would be great if you could expand the parameters section of the docstrings to include a 1 line description of each. For example:

         Parameters
         ----------
         dim : str
             Dimension(s) over which to compute rank.
         pct : bool, optional
             If True, compute percentage ranks, otherwise compute integer ranks.
         keep_attrs : bool, optional
             If True, the dataset's attributes (`attrs`) will be copied from
             the original object to the new one.  If False (default), the new
             object will be returned without attributes.

@jhamman jhamman changed the title [WIP] initial support for rank Rank Methods Dec 7, 2017
@max-sixty
Copy link
Collaborator

This looks great @0x0L thanks & congrats on your first contribution!

# int
v = Variable(['x'], [3,2,1])
expect = bn.rankdata(v.data, axis=0)
np.testing.assert_allclose(v.rank('x').values, expect)
Copy link
Collaborator

@max-sixty max-sixty Dec 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI for the future, xarray.test has these natively, and assert_equal rather than the older self. methods
(tbc, no need to change)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A shameless copy/paste from from the test above :)

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of minor points.

for name, var in iteritems(self.variables):
if name in self.data_vars and dim in var.dims:
variables[name] = var.rank(dim, pct=pct)
variables.update({
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks a little complicated to me, and also will drop non-dimension coordinates. As a simpler rule, how about simply keeping all existing coordinates?

This could look like:

if name in self.data_vars:
    if dim in var.dims:
        variables[name] = var.rank(dim, pct=pct)
else:
    variables[name] = var

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would also be good at add test cases verifying:

  1. that coordinates stick around
  2. that invalid data variables are dropped

import bottleneck as bn

if isinstance(self.data, dask_array_type):
raise TypeError("rank does not work for arrays stored as dask "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add a test that this error is raised, e.g., using pytest.raises or raises_regex

Variables that do not depend on `dim` are dropped.
"""
if dim not in self.dims:
raise ValueError('Dataset does not contain the dimension: %s' % dim)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add test coverage for this condition

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look good to me! I'll merge this in a day or two unless anyone has other comments/suggestions.

else:
variables[name] = var

coord_names = set(k for k in self.coords if k in variables)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be simplified as just coord_names = set(self.coords) now

@shoyer shoyer merged commit a0ef2b7 into pydata:master Dec 18, 2017
@shoyer
Copy link
Member

shoyer commented Dec 18, 2017

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants